EDA - Predicting Project Success

This notebook is an exploratory analysis of the date for the "Predicting Project Success" project.

Member of the team:

  • An Hoang
  • Michael Sedelmeyer
  • Mark McDonald

Notebook Index

1) Notebook Setup

    Imports // Helper Functions // Read Dataset

2) Analysis

    a) Data Cleanup

    b) Counts

    c) Trends

    d) Budget Values

    e) "Missing-ness"

    f) Summary & Findings

    g) Open Items / Unanswered

Imports

Imports for function used in this notebook.

index

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').show();
 } else {
 $('div.input').hide();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:
In [40]:
import os
import math
from datetime import datetime, timedelta

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

# BOKEH IMPORTS
from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import HoverTool, ColumnDataSource
output_notebook()

# Plotly imports
from plotly import express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode()


import missingno as msno
# Avoid scientific notation output in Pandas
# pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.options.display.float_format = '{:,.2f}'.format
import logging

# Improve resolution of output graphcis
%config InlineBackend.figure_format ='retina'
Loading BokehJS ...

Helper Functions

Functions used to accomplish repeated tasks in this notebook.

index

Bokeh Helper Functions

In [3]:
def count_chart(d, count_this, by_this, function='unique'):
    """
    Plot counts by column
    """
    
    if function == 'unique':
        grouping = d.groupby(by_this).nunique()
    elif function == 'count':
        grouping = d.groupby(by_this).count()
    
    grouping['left'] = range(0, len(grouping))
    grouping['right'] = grouping['left'].values + 1 

    tools = [HoverTool(
                        tooltips=[
                            (f"{by_this}", f"@{by_this}"),
                            (f"Num {count_this}'s", f"@{count_this}"),]
                      )
            ]

    p = figure(title=f"Count of Number of {count_this}'s by {by_this}", tools=tools, 
               plot_height=500, plot_width=700, background_fill_color="#fafafa",
              x_range=list(grouping.index))


    p.quad(source=ColumnDataSource(grouping[[count_this, 'left', 'right']]), 
           top=count_this, bottom=0, left='left', right='right',
           line_color="white", alpha=0.7)
    
    p.y_range.start = 0
    p.xaxis.axis_label = f'{by_this}'
    p.yaxis.axis_label = f"Num {count_this}'s"
    p.grid.grid_line_color="white"
    p.xaxis.major_label_orientation = math.pi/3
    show(p)
In [4]:
def hist_chart(d, count_this, for_each):
    
    num = d.groupby(by=for_each).nunique()[count_this]
    
    _left = [x for x in range(1, max(num))]
    _right = list(np.array(_left) + 1)
    
    hist, _ = np.histogram(num, bins=len(_left))

    tools = [HoverTool(
                        tooltips=[
                            (f"This many '{for_each}'", "@top"),
                            (f"Have this many '{count_this}'", "@left"),
                            ]
                      )
            ]
    p = figure(title=f"Count of Number of '{count_this}'", tools=tools, 
               plot_height=300, plot_width=700, background_fill_color="#fafafa")
    p.quad(top=hist, bottom=0, left=_left, right=_right, line_color="white", alpha=0.7) #, fill_color="navy", )
    p.y_range.start = 0
    p.xaxis.axis_label = f"Num '{count_this}'"
    p.yaxis.axis_label = f"Num '{for_each}'"
    p.grid.grid_line_color="white"
    show(p)

Matplotlib Helper Functions

In [5]:
def plot_hist_comps(df, metric_1, metric_2, y_log=False, bins=20):
    """Plots side-by-side histograms for comparison with log yscale option 
    """
    metrics_list = [metric_1, metric_2]
    metrics_str = [
        metric.replace('_', ' ').upper() for metric in metrics_list
    ]
    
    fig, ax = plt.subplots(1, 2, sharey=True, figsize=(12, 4))
    
    plt.suptitle(
        'Projects by {} and {}'.format(*metrics_str),
        fontsize=18
    )
    
    for (i, ax), metric_col, metric_name in zip(enumerate(ax), metrics_list, metrics_str):
        ax.hist(df[metric_col], bins=bins, alpha=0.7)
        ax.set_xlabel(metric_name, fontsize=14)
        ax.grid(':', alpha=0.4)
        if i==0:
            ax.set_ylabel('frequency', fontsize=12)
        if y_log:
            ax.set_yscale('log')
            if i==0:
                ax.set_ylabel('frequency (log scale)', fontsize=12)
            
    
    plt.tight_layout(rect=[0, 0.03, 1, .94])
    plt.show()
In [6]:
def plot_change_trend(trend_data, pid_data, pid):
    """Plots 4 subplots showing project budget and duration forecast change trend
    """
    # sets default for converting datetimes in matplotlib
    from pandas.plotting import register_matplotlib_converters
    from matplotlib.dates import YearLocator, DateFormatter
    register_matplotlib_converters()
    
    years = YearLocator()
    years_fmt = DateFormatter('%Y')
    
    def set_date_axis(ax, years, years_fmt):
        ax.xaxis.set_major_locator(years)
        ax.xaxis.set_major_formatter(years_fmt)
    
    fig, ax = plt.subplots(2,2, sharex=True, figsize=(12,6))
    
    pid_record = pid_data.loc[pid_data['PID']==pid]
    pid_changes = trend_data.loc[trend_data['PID']==pid]

    project_duration = pid_record['Original_Duration'].values[0] + \
                       np.cumsum(pid_changes['Latest_Schedule_Changes'].values)

    plt.suptitle(
        'PID {}: {}\nCategory: {}\nBorough: {}\ninitial duration: {:,.0f} days'.format(
            pid,
            pid_record['Project_Name'].values[0][:72],
            pid_record['Category'].values[0],
            pid_record['Borough'].values[0],
            pid_record['Original_Duration'].values[0]
        ), fontsize=16
    )

    # plot budget forecast
    ax[0,0].plot(
        pid_changes['Date_Reported_As_Of'], pid_changes['Budget_Forecast']/1e7, 'ko-'
    )
    ax[0,0].set_title('Total budget forecast')
    ax[0,0].set_ylabel('USD (millions)')

    # plot budget forecast percent change
    ax[1,0].plot(
        pid_changes['Date_Reported_As_Of'],
        ((pid_changes['Latest_Budget_Changes'])/
        (pid_changes['Budget_Forecast']-pid_changes['Latest_Budget_Changes']))*100,
        'ko-'
    )
    ax[1,0].axhline(0, color='k', linestyle=':')
    ax[1,0].set_title('Percentage budget change')
    ax[1,0].set_ylabel('percent change')

    ax[1,0].set_xlabel('project change date')

    # plot duration forecast
    ax[0,1].plot(
        pid_changes['Date_Reported_As_Of'], project_duration/1e3, 'ko-'
    )
    ax[0,1].set_title('Total forecasted project duration')
    ax[0,1].set_ylabel('days (thousands)')

    # plot duration change
    ax[1,1].plot(
        pid_changes['Date_Reported_As_Of'],
        (pid_changes['Latest_Schedule_Changes'] /
        (project_duration - pid_changes['Latest_Schedule_Changes']))*100,
        'ko-'
    )
    ax[1,1].axhline(0, color='k', linestyle=':')
    ax[1,1].set_title('Percentage duration change')
    ax[1,1].set_ylabel('percent change')
    
    ax[1,1].set_xlabel('project change date')
    
    for a in ax.flat:
        a.grid(':', alpha=0.4)
        set_date_axis(a, years, years_fmt)
    
    plt.tight_layout(rect=[0, 0.03, 1, .81])
    plt.show()

Read Dataset

Read the dataset and perform basic manipulation of headers and some fields including formatting.

index

In [7]:
file_path = '../data/Capital_Projects.csv'
if os.path.isfile(file_path):
    print("OK - path points to file.")
else:
    print("ERROR - check the 'file_path' and ensure it points to the source file.")
OK - path points to file.
In [8]:
data = pd.read_csv(file_path)
In [9]:
# entries
print(f"Number of dataset records: {len(data)}")

# num projects
print(f"Number of unique projects in dataset: {len(data['PID'].unique())}")
Number of dataset records: 2259
Number of unique projects in dataset: 378

Data Cleanup

The dataset includes a sizable number of NaN values. The handling for each of them is different depending on the circumstances. This section includes the steps necessary to clean the data and remove NaN values.

index

In [11]:
# Use 'underscores' in column headers instead of spaces
old_cols = list(data.columns)
new_cols = [s.replace(' ', '_') for s in old_cols]
rename_dict = {k:v for k,v in zip(old_cols, new_cols)}
data.rename(columns=rename_dict, inplace=True)

# Change Date fields to date-type
data['Date_Reported_As_Of'] = pd.to_datetime(data['Date_Reported_As_Of'])
data['Design_Start'] = pd.to_datetime(data['Design_Start'])
data['Forecast_Completion'] = pd.to_datetime(data['Forecast_Completion'])
In [12]:
# make sure data is sorted properly
data = data.sort_values(by=['PID', 'Date_Reported_As_Of'])

"Missing-ness"

index

To find "weird" partial missing patterns we followed these steps:

  • Ignore the "Latest_Budget_Changes" and "Latest_Schedule_Changes" since we know these are frequently missing partially
  • Find out whether the project has any missing columns
  • See if the missing column is missing in entirety or just a couple of rows
In [44]:
weird_missing_pids = []
data_missing = data.drop(columns = ["Latest_Budget_Changes", "Latest_Schedule_Changes"])

for pid, df in data_missing.groupby("PID"):
    has_missing_columns = df.isnull().any().any()
    if has_missing_columns:
        missing_columns = df.columns[df.isnull().any()]
        for col in missing_columns:
            entire_col_missing = df[col].isnull().all()
            if not entire_col_missing:
                weird_missing_pids.append(pid)
                break
logging.warning(f"Found weird missing pattern for {len(weird_missing_pids)} projects")       
WARNING:root:Found weird missing pattern for 97 projects
In [50]:
## Print all weird missing pids

# for pid, df in data_missing.query("PID in @weird_missing_pids").groupby("PID"):
#     print(f"Missing patern for project {pid}")
#     msno.matrix(df)
#     plt.show()
In [49]:
missing_example_pids = [470, 546, 534, 553, 959]
for pid, df in data_missing.query("PID in @missing_example_pids").groupby("PID"):
    print(f"Missing patern for project {pid}")
    msno.matrix(df)
    plt.show()
Missing patern for project 470
Missing patern for project 534
Missing patern for project 546
Missing patern for project 553
Missing patern for project 959

Continue Data Cleanup

In [12]:
# rename phases to indicate thier sequence
rename_phases = {
                 'IT':'0-IT',
                 'Scoping/Planning':'1-Scoping/Planning',
                 'Design':'2-Design',
                 'Construction Procurement':'3-Construction Procurement',
                 'Construction':'4-Construction',
                 'Close-Out':'5-Close-Out'}
data.Current_Phase = data.Current_Phase.map(rename_phases)
In [13]:
# Create an original budget column
# The first line in any project is the source of the original budget.
# The original budget for the entire project can be calculated from the
# first line of any project:
# Original_Budget = Budget_Forecast - Latest_Budget_Changes

# if 'Latest_Budget_Changes' is NaN, make it 0 (this is the best info we have)
data['Latest_Budget_Changes'] = data['Latest_Budget_Changes'].fillna(0)

# loop through projects - create dict of original_budget values
orig_budgets_dict = {}
for PID in data['PID'].unique():
    df = data[data['PID']==PID].sort_values(by='Date_Reported_As_Of')
    
    original_budget = df.iloc[0]['Budget_Forecast']-df.iloc[0]['Latest_Budget_Changes']
    orig_budgets_dict.update({PID: original_budget} )
In [14]:
# Several Projects need manual updates because of poor data quality
for k,v in orig_budgets_dict.items():
    if math.isnan(v):
        print(k, v)
569 nan
595 nan
598 nan
600 nan
696 nan
697 nan
707 nan
800 nan
810 nan
870 nan
898 nan
899 nan
900 nan
901 nan
932 nan
934 nan
In [15]:
orig_budgets_dict.update({569: 28932000.00})
orig_budgets_dict.update({595: 0})
orig_budgets_dict.update({598: 0})
orig_budgets_dict.update({600: 0})
orig_budgets_dict.update({696: 0})
orig_budgets_dict.update({697: 0})
orig_budgets_dict.update({707: 0})
orig_budgets_dict.update({800: 0})
orig_budgets_dict.update({810: 0})
orig_budgets_dict.update({870: 0})
orig_budgets_dict.update({898: 27500000})
orig_budgets_dict.update({899: 27500000})
orig_budgets_dict.update({900: 27400000})
orig_budgets_dict.update({901: 27500000})
orig_budgets_dict.update({932: 0})
orig_budgets_dict.update({934: 0})
In [16]:
# apply the original budget values to each project
# create columns series and add it to the dataframe
def update_budget(s):
    return orig_budgets_dict.get(s.PID)

data.insert(10, 'Original_Budget', data.apply(update_budget, axis=1))
In [17]:
# Check that no values are NaN
data[data['Original_Budget'].isna()]
Out[17]:
Date_Reported_As_Of PID Project_Name Description Category Borough Managing_Agency Client_Agency Current_Phase Design_Start Original_Budget Budget_Forecast Latest_Budget_Changes Total_Budget_Changes Forecast_Completion Latest_Schedule_Changes Total_Schedule_Changes
In [18]:
# Where text descriptions are missing use 'not_specified'
data['Description'] = data['Description'].fillna('not_specified')
data['Borough'] = data['Borough'].fillna('not_specified')
data['Current_Phase'] = data['Current_Phase'].fillna('not_specified')
data['Client_Agency'] = data['Client_Agency'].fillna('not_specified')
In [19]:
# several projects have final lines that have no information - remove them
drop_idxs = data[(data['Current_Phase']=='not_specified') & (data['Design_Start'].isna()) & (data['Latest_Schedule_Changes'].isna()) & (data['Budget_Forecast'].isna()) & (data['Latest_Budget_Changes']==0)].index
data = data.drop(index=drop_idxs)
In [20]:
# NaN values per column
data.isna().sum()
Out[20]:
Date_Reported_As_Of          0
PID                          0
Project_Name                 0
Description                  0
Category                     0
Borough                      0
Managing_Agency              0
Client_Agency                0
Current_Phase                0
Design_Start                59
Original_Budget              0
Budget_Forecast              1
Latest_Budget_Changes        0
Total_Budget_Changes        53
Forecast_Completion         17
Latest_Schedule_Changes    288
Total_Schedule_Changes      61
dtype: int64

Calculate Missing NA's where Possible

In [21]:
# Fix entries where Total_Budget_Changes is nan

# Can be calculated as the sum of all 'Latest_Budget_Changes'

# loop through projects - create dict of 'Total_Budget_Changes' values
ttl_budget_changes_dict = {}
for PID in data['PID'].unique():
    df = data[data['PID']==PID].sort_values(by='Date_Reported_As_Of')
    
    # if no nan's, just take the last ttl value
    if len(df['Total_Budget_Changes'].isna()) == 0:
        ttl_budget_changes_dict.update({PID: df.iloc[-1]['Total_Budget_Changes']})
    else: # there is at least one nan
        # if they are all nan - calc as the sum of 'Latest_Budget_Changes'
        if data[data['PID']==PID]['Total_Budget_Changes'].isna().all():
            ttl_budget_changes_dict.update({PID: df['Latest_Budget_Changes'].sum()})
        
        else: # not all are nan, take the max of 'Total_Budget_Changes'
            ttl_budget_changes_dict.update({PID: df['Total_Budget_Changes'].max()})

# apply the total budget changes values to each project
# create columns series and add it to the dataframe
def update_changes(s):
    return ttl_budget_changes_dict.get(s.PID)

data['Total_Budget_Changes'] = data.apply(update_changes, axis=1)
In [22]:
# Update/Correct Latest_Schedule_Changes that have nan values

# if the first entry has a nan, make it 0

# loop through projects - create dict of 'Total_Budget_Changes' values
latest_schedule_changes_list = []

new_PID = False
prev_row = None
for i, row in data.iterrows():
    
    # see if this is the first entry for a project
    try:
        if row.PID != prev_row.PID:
            new_PID = True
    except:
        new_PID = True
    
    x = row['Latest_Schedule_Changes']
    
    # if nan and first entry, make 0
    if pd.isna(row['Latest_Schedule_Changes']) and new_PID:
        x = 0
        
    # if nan and not first row, calculate 
    elif pd.isna(row['Latest_Schedule_Changes']) and not new_PID:
        
        # if forecast_completion is nan, we can back into the value
        if pd.isna(row['Forecast_Completion']) or pd.isna(prev_row['Forecast_Completion']):
            
            
            # if the changes are all accounted for, make the entry 0
            if row['Total_Schedule_Changes'] - data[data.PID == row.PID]['Latest_Schedule_Changes'].sum() == 0:
                x = 0
            
            # see if this is the only NA row, we can back into the value
            elif data[data.PID == row.PID]['Latest_Schedule_Changes'].isna().sum() == 1:
                x = row['Total_Schedule_Changes'] - data[data.PID == row.PID]['Latest_Schedule_Changes'].sum()
            
            # othewise, can't calculate
                
        else: # there is a forecast_completion date and we can easily calculte the change
            x = row['Forecast_Completion'] - prev_row['Forecast_Completion']

    latest_schedule_changes_list.append(x)
    
    new_PID = False
    prev_row = row.copy()

data['Latest_Schedule_Changes'] = latest_schedule_changes_list
    
In [23]:
# this project is a lost cause
data[data['Latest_Schedule_Changes'].isna()]
Out[23]:
Date_Reported_As_Of PID Project_Name Description Category Borough Managing_Agency Client_Agency Current_Phase Design_Start Original_Budget Budget_Forecast Latest_Budget_Changes Total_Budget_Changes Forecast_Completion Latest_Schedule_Changes Total_Schedule_Changes
2084 2019-01-01 863 Network Infrastructure Refresh Program This project has a two-fold basis: 1) The ongo... Health and Hospitals, IT not_specified HHC H+H 0-IT NaT 0.00 0.00 0.00 160,090,000.00 NaT nan nan
2085 2019-05-01 863 Network Infrastructure Refresh Program This project has a two-fold basis: 1) The ongo... Health and Hospitals, IT not_specified HHC H+H 0-IT NaT 0.00 160,090,000.00 160,090,000.00 160,090,000.00 NaT nan nan
2086 2019-09-01 863 Network Infrastructure Refresh Program This project has a two-fold basis: 1) The ongo... Health and Hospitals, IT not_specified HHC H+H 0-IT NaT 0.00 160,090,000.00 0.00 160,090,000.00 2023-12-31 nan nan
In [24]:
# Now, calculate any missing Forecast_Completion values

fc_completion_list = []

new_PID = False
prev_row = None
for i, row in data.iterrows():
    
    # see if this is the first entry for a project
    try:
        if row.PID != prev_row.PID:
            new_PID = True
    except:
        new_PID = True
        
    x = row['Forecast_Completion']
    
    if pd.isna(x) and not new_PID:
        # we can back into the date
        x = prev_row['Forecast_Completion'] + timedelta(int(row['Latest_Schedule_Changes']))
        
    fc_completion_list.append(x)
        
    
data['Forecast_Completion'] = fc_completion_list
In [25]:
# now calculate any missing Total_Schedule_Changes

ttl_fc_changes_dict = {}
for PID in data['PID'].unique():
    df = data[data['PID']==PID].sort_values(by='Date_Reported_As_Of')
    
    # if no nan's, just take the last ttl value
    if len(df['Total_Schedule_Changes'].isna()) == 0:
        ttl_fc_changes_dict.update({PID: df.iloc[-1]['Total_Schedule_Changes']})
    else: # there is at least one nan
        # if they are all nan - calc as the sum of 'Latest_Schedule_Changes'
        if data[data['PID']==PID]['Total_Schedule_Changes'].isna().all():
            ttl_fc_changes_dict.update({PID: df['Latest_Schedule_Changes'].sum()})
        
        else: # not all are nan, take the max of 'Total_Schedule_Changes'
            ttl_fc_changes_dict.update({PID: df['Total_Schedule_Changes'].max()})


def update_changes(s):
    return ttl_fc_changes_dict.get(s.PID)

data['Total_Schedule_Changes'] = data.apply(update_changes, axis=1)
In [26]:
# Create an original schedule column

# in the first line of a project, back into the origonal date
# this is the best that we can do

try:
    data.drop(columns=['Original_Schedule'], inplace=True)
except:
    pass

orig_sched_dict = {}

for PID in data['PID'].unique():
    df = data[data['PID']==PID]
    
    x = df.iloc[0]['Forecast_Completion'] - timedelta(days=int(df.iloc[0]['Latest_Schedule_Changes']))
    
    # if the first record couldn't be used, use the last record
    if pd.isna(x):
        x = df.iloc[-1]['Forecast_Completion']- timedelta(days=int(df.iloc[-1]['Total_Schedule_Changes']))
        
    
    orig_sched_dict.update({PID: x})
    

def update_schedule(s):
    return orig_sched_dict.get(s.PID)

data.insert(14, 'Original_Schedule', data.apply(update_schedule, axis=1))
    
    
In [27]:
data.isna().sum()
Out[27]:
Date_Reported_As_Of         0
PID                         0
Project_Name                0
Description                 0
Category                    0
Borough                     0
Managing_Agency             0
Client_Agency               0
Current_Phase               0
Design_Start               59
Original_Budget             0
Budget_Forecast             1
Latest_Budget_Changes       0
Total_Budget_Changes        0
Original_Schedule           5
Forecast_Completion        17
Latest_Schedule_Changes     3
Total_Schedule_Changes      0
dtype: int64
In [28]:
# All projects in 'Scoping/Planning' phase have no scheduled date yet. Make themm = 0.
x_idx = data[data['Current_Phase']=='1-Scoping/Planning'].index
for i in x_idx:
    data.loc[i] = data.loc[i].fillna(0)
In [29]:
data.isna().sum()
Out[29]:
Date_Reported_As_Of         0
PID                         0
Project_Name                0
Description                 0
Category                    0
Borough                     0
Managing_Agency             0
Client_Agency               0
Current_Phase               0
Design_Start               55
Original_Budget             0
Budget_Forecast             1
Latest_Budget_Changes       0
Total_Budget_Changes        0
Original_Schedule           1
Forecast_Completion        13
Latest_Schedule_Changes     3
Total_Schedule_Changes      0
dtype: int64

Cleanup is Complete!

Counts

index

How many updates do we have for each project?

In [30]:
hist_chart(d=data, count_this='Date_Reported_As_Of', for_each='PID')

How many updates do we have for each project Category?

In [137]:
px.histogram(data, x="Date_Reported_As_Of", color ="Category", barmode="group")
In [151]:
px.histogram(data, x="Date_Reported_As_Of", color ="Category", barmode="group", facet_col="Managing_Agency", facet_col_wrap = 4)

Distribution of projects by categories

In [31]:
count_chart(d=data, count_this='PID', by_this='Category')

Distribution of projects by Boroughs

In [32]:
count_chart(d=data, count_this='PID', by_this='Borough')
In [33]:
count_chart(d=data, count_this='PID', by_this='Managing_Agency')
In [34]:
count_chart(d=data, count_this='PID', by_this='Client_Agency')
In [35]:
count_chart(d=data, count_this='PID', by_this='Current_Phase')
In [36]:
# updates per phase
count_chart(d=data, count_this='Date_Reported_As_Of', by_this='Current_Phase', function='count')
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

index

In [37]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2154 entries, 0 to 2258
Data columns (total 18 columns):
Date_Reported_As_Of        2154 non-null datetime64[ns]
PID                        2154 non-null int64
Project_Name               2154 non-null object
Description                2154 non-null object
Category                   2154 non-null object
Borough                    2154 non-null object
Managing_Agency            2154 non-null object
Client_Agency              2154 non-null object
Current_Phase              2154 non-null object
Design_Start               2099 non-null object
Original_Budget            2154 non-null float64
Budget_Forecast            2153 non-null float64
Latest_Budget_Changes      2154 non-null float64
Total_Budget_Changes       2154 non-null float64
Original_Schedule          2153 non-null object
Forecast_Completion        2141 non-null object
Latest_Schedule_Changes    2151 non-null float64
Total_Schedule_Changes     2154 non-null float64
dtypes: datetime64[ns](1), float64(6), int64(1), object(10)
memory usage: 399.7+ KB

Cleanse datetime columns for time series plotting

In [38]:
# identify records with '0' value 'Design_Start', causing datetime conversion errors
drop_PID = list(data.loc[data['Design_Start']==0]['PID'].values)
n_drop_PID_records = len(data.loc[data['Design_Start']==0]['PID'])
n_records = len(data)

# print summary of records
print(
    'The following projects have zero-valued "Design_Start" dates and '\
    'will be dropped, resulting in {} total records dropped:\n\n\t{}\n'\
    ''.format(n_drop_PID_records, drop_PID)
)

# drop identified PID records
data = data.copy().loc[~data['PID'].isin(drop_PID)]

print(
    'The resulting dataframe, now contains {:,} records'.format(len(data))
)
The following projects have zero-valued "Design_Start" dates and will be dropped, resulting in 4 total records dropped:

	[374, 791, 794, 960]

The resulting dataframe, now contains 2,150 records
In [39]:
# re-establish datetime columns as correct type
date_cols = [
    'Date_Reported_As_Of',
    'Forecast_Completion',
    'Design_Start',
    'Original_Schedule'
]

for col in date_cols:
    data[col] = pd.to_datetime(data[col])

# print summary stats for datetime cols
data[date_cols].describe()
Out[39]:
Date_Reported_As_Of Forecast_Completion Design_Start Original_Schedule
count 2150 2137 2095 2149
unique 15 1019 341 298
top 2019-05-01 00:00:00 2021-09-03 00:00:00 2004-06-30 00:00:00 2021-09-03 00:00:00
freq 296 58 42 59
first 2013-09-01 00:00:00 2014-07-01 00:00:00 1993-08-02 00:00:00 2014-07-01 00:00:00
last 2019-09-01 00:00:00 2108-10-21 00:00:00 2019-09-30 00:00:00 2028-12-04 00:00:00
In [40]:
# fix invalid year "2108", assuming it was a keying error
data.loc[data['Forecast_Completion']=='2108-10-21', 'Forecast_Completion']  = pd.Timestamp('2018-10-21')

# print summary stats for datetime cols once more
data[date_cols].describe()
Out[40]:
Date_Reported_As_Of Forecast_Completion Design_Start Original_Schedule
count 2150 2137 2095 2149
unique 15 1018 341 298
top 2019-05-01 00:00:00 2021-09-03 00:00:00 2004-06-30 00:00:00 2021-09-03 00:00:00
freq 296 58 42 59
first 2013-09-01 00:00:00 2014-07-01 00:00:00 1993-08-02 00:00:00 2014-07-01 00:00:00
last 2019-09-01 00:00:00 2036-01-14 00:00:00 2019-09-30 00:00:00 2028-12-04 00:00:00
In [41]:
# verify resuling data types
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2150 entries, 0 to 2258
Data columns (total 18 columns):
Date_Reported_As_Of        2150 non-null datetime64[ns]
PID                        2150 non-null int64
Project_Name               2150 non-null object
Description                2150 non-null object
Category                   2150 non-null object
Borough                    2150 non-null object
Managing_Agency            2150 non-null object
Client_Agency              2150 non-null object
Current_Phase              2150 non-null object
Design_Start               2095 non-null datetime64[ns]
Original_Budget            2150 non-null float64
Budget_Forecast            2149 non-null float64
Latest_Budget_Changes      2150 non-null float64
Total_Budget_Changes       2150 non-null float64
Original_Schedule          2149 non-null datetime64[ns]
Forecast_Completion        2137 non-null datetime64[ns]
Latest_Schedule_Changes    2147 non-null float64
Total_Schedule_Changes     2150 non-null float64
dtypes: datetime64[ns](4), float64(6), int64(1), object(7)
memory usage: 319.1+ KB

Generate dataframe of unique project records with associated metrics

In [47]:
# convert dataframe single record for each PID
df_pid = data.groupby('PID').agg(
    {
        'Project_Name': 'first',
        'Category': 'first',
        'Borough': 'first',
        'Managing_Agency': 'first',
        'Client_Agency': 'first',
        'Date_Reported_As_Of': 'count',
        'Original_Budget': 'first',
        'Design_Start': min,
        'Original_Schedule': 'first',
        'Forecast_Completion': 'last',
        'Total_Schedule_Changes': max,
        'Total_Budget_Changes': max
    }
).reset_index()

# rename column for number of changes
df_pid = df_pid.rename(columns={'Date_Reported_As_Of': 'Number_Changes'})

# define original forecasted project duration
df_pid['Original_Duration'] = (df_pid['Original_Schedule'] - df_pid['Design_Start']).dt.days

# define budget change ratio
df_pid['Budget_Change_Ratio'] = df_pid['Total_Budget_Changes']/df_pid['Original_Budget']

# define schedule change ratio
df_pid['Schedule_Change_Ratio'] = df_pid['Total_Schedule_Changes']/df_pid['Original_Duration']

Drop projects that are missing critical date and budget metrics

In [48]:
# Drop projects with missing or zero values for each of the following columns
drop_filter_cols = [
    'Original_Budget',
    'Design_Start',
    'Original_Schedule'
]

# define df.loc filter to simplify list comprehension and drop functions
loc_filter = lambda df, col: (df[col]==0) | (df[col].isnull())

# store list of tuples identify the PID and corresponding category of each to be dropped
drop_pids = sum(
    [
        list(
            zip(
                [*df_pid.loc[loc_filter(df_pid, col)]['PID'].values],
                [*df_pid.loc[loc_filter(df_pid, col)]['Category'].values]
            )
        ) for col in drop_filter_cols
    ], []
)

# print summary of PIDs to be dropped
print(
    'The following PIDs will be dropped because they have missing or '\
    'zero values for the columns:\n\n\t{}\n\nPID\tCategory'.format(drop_filter_cols)
)
for pid in drop_pids:
    print('{}\t{}'.format(*pid))

# drop PIDs from dataframe
for col in drop_filter_cols:
    df_pid = df_pid.copy().loc[~loc_filter(df_pid, col)]

# print summary of remaining PIDs
print(
    '\nThe resulting dataframe contains {} unique projects\n'.format(len(df_pid))
)
The following PIDs will be dropped because they have missing or zero values for the columns:

	['Original_Budget', 'Design_Start', 'Original_Schedule']

PID	Category
863	Health and Hospitals, IT
910	IT
267	IT
403	Health and Hospitals, IT
553	IT
723	Health and Hospitals, IT
863	Health and Hospitals, IT
910	IT
911	Public Safety and Criminal Justice
1006	Industrial Development

The resulting dataframe contains 355 unique projects

Examine relationships between metrics by project

In [49]:
# generate scatter matrix with each quantitative metric
scatter_cols = [
    'Number_Changes',
    'Original_Budget',
    'Total_Budget_Changes',
    'Budget_Change_Ratio',
    'Original_Duration',
    'Total_Schedule_Changes',
    'Schedule_Change_Ratio'
]

g = sns.pairplot(df_pid[scatter_cols], plot_kws={'alpha': 0.3, 's': 100})
g.fig.suptitle('Distribution of budget and duration changes by project', fontsize=22, y=1.01)
plt.show();
In [50]:
plot_hist_comps(df_pid, 'Original_Budget', 'Original_Duration', y_log=True)
plot_hist_comps(df_pid, 'Total_Budget_Changes', 'Total_Schedule_Changes', y_log=True)
plot_hist_comps(df_pid, 'Budget_Change_Ratio', 'Schedule_Change_Ratio', y_log=True)
In [51]:
def identify_edge_pids(df, metric_col, n_pids=2,
                       change_threshold=5, change_col='Number_Changes'):
    """Identifies PIDs with minimum and maximum values for any given metric
    """
    loc_filter = df[change_col]>=change_threshold
    max_pids = list(
        df.loc[loc_filter].sort_values(by=metric_col, ascending=True)[-n_pids:]['PID']
    )
    min_pids = list(
        df.loc[loc_filter].sort_values(by=metric_col, ascending=True)[:n_pids]['PID']
    )
    
    return {'min': min_pids, 'max': max_pids}
In [52]:
metric_col_list = [
    'Original_Budget',
    'Original_Duration',
    'Budget_Change_Ratio',
    'Schedule_Change_Ratio',
]

change_threshold = 5

edge_pids_dict = {
    metric_col: identify_edge_pids(
        df_pid, metric_col, n_pids=2, change_threshold=change_threshold
    )
    for metric_col in metric_col_list
}

print(
    'Projects with {} or more change records, and the largest and '\
    'smallest metric values are as follows:\n'.format(change_threshold)
)

for key, value in edge_pids_dict.items():
    print('{}\n'.format(key))
    for edge, pid_list in value.items():
        print('{}:\t{}'.format(edge, pid_list))
    print()
Projects with 5 or more change records, and the largest and smallest metric values are as follows:

Original_Budget

min:	[603, 784]
max:	[555, 416]

Original_Duration

min:	[480, 507]
max:	[96, 112]

Budget_Change_Ratio

min:	[537, 594]
max:	[603, 784]

Schedule_Change_Ratio

min:	[482, 690]
max:	[480, 193]

Time series plots for projects with extreme project changes or extreme scope values

In [53]:
pid_plot_list = [
    603, 555, 480, 96, 594, 574, 482
]

for pid in pid_plot_list:
    plot_change_trend(data, df_pid, pid)
    print()







Budget Values

index

Stationary budget values (originally forecasted vs to date)

Metrics calculated:

  • Absolute percentage error: |Forecast − Actual|/Actual
  • Relative percentage error: (referred to as cost escalation in project estimation literature): |Forecast − Actual|/Forecast:
  • Ratios: Forecast/Actual or Actual/Forecast Difference: Actual − Forecast
  • Days delayed (only the positive differences) Absolute error: |Forecast − Actual| Squared error: (Forecast − Actual)2
In [59]:
df_pid = df_pid.eval('Budget_To_Date = Original_Budget + Total_Budget_Changes')\
            .eval('Budget_Abs_Per_Error = abs(Original_Budget - Budget_To_Date)/Budget_To_Date')\
            .eval('Budget_Rel_Per_Error = abs(Original_Budget - Budget_To_Date)/Original_Budget')\
            .eval('Budget_Ratios = Budget_To_Date/Original_Budget')
df_pid
Out[59]:
PID Project_Name Category Borough Managing_Agency Client_Agency Number_Changes Original_Budget Design_Start Original_Schedule Forecast_Completion Total_Schedule_Changes Total_Budget_Changes Original_Duration Budget_Change_Ratio Schedule_Change_Ratio Budget_To_Date Budget_Abs_Per_Error Budget_Rel_Per_Error Budget_Ratios
0 3 26th Ward Waste Water Treatment Plant Prelimin... Wastewater Treatment Brooklyn DEP DEP 13 184,896,360.00 2013-09-23 2020-01-13 2020-10-09 270.00 -4,318,643.37 2,303.00 -0.02 0.12 180,577,716.63 0.02 0.02 0.98
1 7 Bowery Bay Waste Water Treatment Plant Main Se... Wastewater Treatment Queens DEP DEP 15 53,493,000.00 2013-05-01 2018-03-31 2020-02-14 685.00 15,305,457.00 1,795.00 0.29 0.38 68,798,457.00 0.22 0.29 1.29
2 18 Croton Falls Pumping Station Construction Water Supply Carmel DEP DEP 15 66,616,435.00 2003-08-22 2017-05-31 2020-03-10 1,014.00 1,255,175.02 5,031.00 0.02 0.20 67,871,610.02 0.02 0.02 1.02
3 25 Gowanus Facilities Upgrade Wastewater Treatment Brooklyn DEP DEP 15 198,134,067.00 2001-11-01 2014-08-05 2019-12-31 1,974.00 25,233,101.11 4,660.00 0.13 0.42 223,367,168.11 0.11 0.13 1.13
4 34 Jamaica Waste Water Treatment Plant Upgrades Wastewater Treatment Queens DEP DEP 15 463,790,739.00 1996-05-01 2014-07-01 2019-12-31 2,009.00 -14,632,471.11 6,635.00 -0.03 0.30 449,158,267.89 0.03 0.03 0.97
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
358 1010 PRIOR NOTICE SIDEWALKS - BROOKLYN Streets and Roadways not_specified DDC DOT 1 29,427,000.00 2015-08-06 2020-05-13 2020-05-13 0.00 0.00 1,742.00 0.00 0.00 29,427,000.00 0.00 0.00 1.00
359 1011 RECONSTRUCTION OF ROSEDALE AREA, PHASE 2 Streets and Roadways not_specified DDC DOT 1 39,360,000.00 2018-06-07 2024-08-14 2024-08-14 0.00 0.00 2,260.00 0.00 0.00 39,360,000.00 0.00 0.00 1.00
360 1012 RECONSTRUCTION OF FRONT STREET, MANHATTAN Streets and Roadways not_specified DDC DOT 1 27,356,000.00 2015-07-31 2022-06-16 2022-06-16 0.00 0.00 2,512.00 0.00 0.00 27,356,000.00 0.00 0.00 1.00
361 1013 GRAND CONCOURSE, PHASE 5 Streets and Roadways not_specified DDC DOT 1 78,921,000.00 2019-09-16 2025-09-13 2025-09-13 0.00 0.00 2,189.00 0.00 0.00 78,921,000.00 0.00 0.00 1.00
362 1014 RECONSTRUCTION OF BROADWAY PHASE 2 Streets and Roadways not_specified DDC DOT 1 31,110,000.00 2019-08-22 2026-07-17 2026-07-17 0.00 0.00 2,521.00 0.00 0.00 31,110,000.00 0.00 0.00 1.00

355 rows × 20 columns

In [63]:
df_pid_melted = df_pid.melt(id_vars=set(df_pid.columns) - set(["Budget_Abs_Per_Error", "Budget_Rel_Per_Error", 'Budget_Ratios']), value_vars=["Budget_Abs_Per_Error", "Budget_Rel_Per_Error", 'Budget_Ratios'], var_name="metric", value_name="metric_value")
df_pid_melted
Out[63]:
Number_Changes Original_Duration Project_Name Borough Budget_Change_Ratio Original_Budget Schedule_Change_Ratio Managing_Agency Original_Schedule Category Design_Start Forecast_Completion Total_Budget_Changes PID Client_Agency Total_Schedule_Changes Budget_To_Date metric metric_value
0 13 2,303.00 26th Ward Waste Water Treatment Plant Prelimin... Brooklyn -0.02 184,896,360.00 0.12 DEP 2020-01-13 Wastewater Treatment 2013-09-23 2020-10-09 -4,318,643.37 3 DEP 270.00 180,577,716.63 Budget_Abs_Per_Error 0.02
1 15 1,795.00 Bowery Bay Waste Water Treatment Plant Main Se... Queens 0.29 53,493,000.00 0.38 DEP 2018-03-31 Wastewater Treatment 2013-05-01 2020-02-14 15,305,457.00 7 DEP 685.00 68,798,457.00 Budget_Abs_Per_Error 0.22
2 15 5,031.00 Croton Falls Pumping Station Construction Carmel 0.02 66,616,435.00 0.20 DEP 2017-05-31 Water Supply 2003-08-22 2020-03-10 1,255,175.02 18 DEP 1,014.00 67,871,610.02 Budget_Abs_Per_Error 0.02
3 15 4,660.00 Gowanus Facilities Upgrade Brooklyn 0.13 198,134,067.00 0.42 DEP 2014-08-05 Wastewater Treatment 2001-11-01 2019-12-31 25,233,101.11 25 DEP 1,974.00 223,367,168.11 Budget_Abs_Per_Error 0.11
4 15 6,635.00 Jamaica Waste Water Treatment Plant Upgrades Queens -0.03 463,790,739.00 0.30 DEP 2014-07-01 Wastewater Treatment 1996-05-01 2019-12-31 -14,632,471.11 34 DEP 2,009.00 449,158,267.89 Budget_Abs_Per_Error 0.03
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1060 1 1,742.00 PRIOR NOTICE SIDEWALKS - BROOKLYN not_specified 0.00 29,427,000.00 0.00 DDC 2020-05-13 Streets and Roadways 2015-08-06 2020-05-13 0.00 1010 DOT 0.00 29,427,000.00 Budget_Ratios 1.00
1061 1 2,260.00 RECONSTRUCTION OF ROSEDALE AREA, PHASE 2 not_specified 0.00 39,360,000.00 0.00 DDC 2024-08-14 Streets and Roadways 2018-06-07 2024-08-14 0.00 1011 DOT 0.00 39,360,000.00 Budget_Ratios 1.00
1062 1 2,512.00 RECONSTRUCTION OF FRONT STREET, MANHATTAN not_specified 0.00 27,356,000.00 0.00 DDC 2022-06-16 Streets and Roadways 2015-07-31 2022-06-16 0.00 1012 DOT 0.00 27,356,000.00 Budget_Ratios 1.00
1063 1 2,189.00 GRAND CONCOURSE, PHASE 5 not_specified 0.00 78,921,000.00 0.00 DDC 2025-09-13 Streets and Roadways 2019-09-16 2025-09-13 0.00 1013 DOT 0.00 78,921,000.00 Budget_Ratios 1.00
1064 1 2,521.00 RECONSTRUCTION OF BROADWAY PHASE 2 not_specified 0.00 31,110,000.00 0.00 DDC 2026-07-17 Streets and Roadways 2019-08-22 2026-07-17 0.00 1014 DOT 0.00 31,110,000.00 Budget_Ratios 1.00

1065 rows × 19 columns

Comparing 3 budget metrics

In [66]:
px.histogram(df_pid_melted,x="metric_value", facet_col="metric", color="Category", barmode="overlay" )

Stratify by category and boroughs

In [79]:
px.histogram(df_pid_melted,x="metric_value", facet_col="metric", facet_row= "Category", color="Borough", barmode="overlay" , height=5000)
In [124]:
fig = px.line(data.sort_values(['Date_Reported_As_Of', "PID"]), x='Date_Reported_As_Of', y="Latest_Budget_Changes", color= "Category", line_dash="PID", hover_name="PID", height=1000)
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=6, label="6m", step="month", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
fig
In [122]:
fig = px.line(data.sort_values(['Date_Reported_As_Of', "PID"]), x='Date_Reported_As_Of', y="Latest_Budget_Changes", facet_col = "Borough", color= "Category", hover_name="PID", line_group="PID", facet_col_wrap=5, height= 2000)
fig

Stratify by managing agency

In [144]:
fig = px.line(data.sort_values(['Date_Reported_As_Of', "PID"]), x='Date_Reported_As_Of', y="Latest_Budget_Changes", facet_col = "Managing_Agency", color= "Category", hover_name="PID", line_group="PID", facet_col_wrap=5, height= 2000)
fig
In [145]:
fig = px.line(data.sort_values(['Date_Reported_As_Of', "PID"]), x='Date_Reported_As_Of', y="Latest_Budget_Changes", facet_col = "Client_Agency", color= "Category", hover_name="PID", line_group="PID", facet_col_wrap=5, height= 2000)
fig

Summary & Findings

index

Poor data quality.

Data is inconsistent in quality. Human-entered data will always have inconsistencies. We will need to work around or with that. Any model will need more data. We will need to source more project data from other regions and types of projects to increase the variety Lots of missing data. To feed a model, some data engineering will be necessary. The types of dat that need to be created are not yet understood. Any efforts to normalize data will need to consider normalization within the project as well as within the entire dataset.

Open Items / Unanswered

index